Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support indexing request WARC records: #82

Merged
merged 7 commits into from
Nov 7, 2024
Merged

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Nov 6, 2024

  • support the --fields for cdx-index command
  • support req.* fields which only apply to request records, both WARC and HTTP, other headers apply to response/main record
  • support referrer as special shortcut for req.http:referer
  • tests: update tests to include 'req.http:cookie' include in cdx
  • tests: update tests to include 'referrer' in cdx
  • compatibility with python cdxj-indexer
  • version: bump to 2.4.0

- support customizing --fields for cdxj indexing
- support 'req.*' fields which only apply to request records, other headers apply to response/main record
- support 'referrer' as special shortcut for 'req.http:referer'
- tests: update tests to include 'req.http:cookie' include in cdx
- tests: update tests to include 'referrer' in cdx

version: bump to 2.4.0
@ikreymer ikreymer requested a review from tw4l November 6, 2024 01:41
@ikreymer
Copy link
Member Author

ikreymer commented Nov 6, 2024

Released 2.4.0-beta.0, tested with crawler, appears to be working as intended.

Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests

})
.option("fields", {
alias: "f",
describe: "fields to include in index",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
describe: "fields to include in index",
describe: "comma-separated list of fields to include in index",

Since we're not using yarg's array type, might be good to be explicit about the expected format of the input

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe should use the array type, will look.

@ikreymer
Copy link
Member Author

ikreymer commented Nov 7, 2024

Also added type-safety for cli args, made them arrays with defaults

@ikreymer ikreymer merged commit 1e53bbb into main Nov 7, 2024
10 checks passed
@ikreymer ikreymer deleted the cdx-index-http-request branch November 7, 2024 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants